Khmer Word Segmentation and Out - of - Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences
نویسنده
چکیده
منابع مشابه
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملWhich is More Suitable for Chinese Word Segmentation , the Generative Model or the Discriminative One ? F ∗
Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before...
متن کاملRecurrent Out-of-Vocabulary Word Detection Using Distribution of Features
The repeated use of out-of-vocabulary (OOV) words in a spoken document seriously degrades a speech recognizer’s performance. This paper provides a novel method for accurately detecting such recurrent OOV words. Standard OOV word detection methods classify each word segment into in-vocabulary (IV) or OOV. This word-by-word classification tends to be affected by sudden vocal irregularities in spo...
متن کاملWhich is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?
Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before...
متن کاملAn Effective Character Separation Method for Online Cursive Uyghur Handwriting
There are many connected characters in cursive Uyghur handwriting, which makes the segmentation and recognition of Uyghur words very difficult. To enable large vocabulary Uyghur word recognition using character models, we propose a character separation method for over-segmentation in online cursive Uyghur handwriting. After removing delayed strokes from the handwritten words, potential breakpoi...
متن کامل